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Abstract —Coronary heart disease (CHD) caused by hardening 
of artery walls due to cholesterol known as atherosclerosis is 
responsible for large number of deaths world-wide. The disease 
progression is slow, asymptomatic and may lead to sudden car¬ 
diac arrest, stroke or myocardial infraction. Presently, imaging 
techniques are being employed to understand the molecular and 
metabolic activity of atherosclerotic plaques to estimate the risk. 
Though imaging methods are able to provide some information 
on plaque metabolism they lack the required resolution and 
sensitivity for detection. In this paper we consider the clinical 
observations and habits of individuals for predicting the risk 
factors of CHD. The identification of risk factors helps in 
stratifying patients for further intensive tests such as nuclear 
imaging or coronary angiography. We present a novel approach 
for predicting the risk factors of atherosclerosis with an in-built 
imputation algorithm and particle swarm optimization (PSO). 
We compare the performance of our methodology with other 
machine learning techniques on STULONG dataset which is 
based on longitudinal study of middle aged individuals lasting 
for twenty years. Our methodology powered by PSO search has 
identified physical inactivity as one of the risk factor for the onset 
of atherosclerosis in addition to other already known factors. The 
decision rules extracted by our methodology are able to predict 
the risk factors with an accuracy of 99.73% which is higher 
than the accuracies obtained by application of the state-of-the- 
art machine learning techniques presently being employed in the 
identification of atherosclerosis risk studies. 

Index Terms —Atherosclerosis, Classification, Risk factors, Pre¬ 
diction, Imputation, Feature selection, Particle swarm optimiza¬ 
tion, Decision trees 

I. Introduction 

THEROSCLEROSIS is a systemic/chronic disease char¬ 
acterized by the accumulation of inflammatory cells and 
lipids in the inner lining of the arteries. It is a leading 
cardiovascular disease causing deaths worldwide ID, El- Four 
million Americans have survived a stroke and lead disabled 
lives. More than 1 out of 3 (83 million) U.S. adults currently 
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live with one or more types of cardiovascular diseases 0. 
In 2010, the total amount spent on cardiovascular diseases in 
the United States was estimated to be $444 billion 0. The 
number of deaths due to coronary artery disease in India were 
projected to increase from 1.591 million in the year 2000 to 
2.034 million by the year 2010 0. 

Atherosclerosis has a long asymptomatic phase with a sub- 
clinical incubation period ranging from 30 to 50 years. The 
physicians would like to assess the risk of the patients having a 
severe clinical event such as stroke or heart attack and predict 
the same if possible. Some of the major known risk factors 
that eventually lead to the development of atherosclerosis are 
as follows: (i) family history of premature coronary heart 
disease or stoke in a first degree relative under the age of 
60, (ii) tobacco abuse, (iii) type II diabetes, (iv) high blood 
pressure, (v) left ventricular hypertrophy, (vi) high triglyc¬ 
erides, (vii) high low-density lipoprotein (LDL) cholesterol 
(viii) low high-density lipoprotein (HDL) cholesterol, (ix) high 
total cholesterol. Large number of factors influence the onset 
of atherosclerosis making it a difficult task for the physicians 
to diagnose in its early stages. Though main risk factors are 
identified, development of automated effective risk prediction 
models using data mining techniques becomes essential for 
better health monitoring and prevention of deaths due to 
cardiovascular diseases 0, 0- Establishing a diagnostic 
procedure for early detection of atherosclerosis disease is very 
important as any delay would increase the risk of serious 
complications or even disability. Determining the conditions 
(risk factors) predisposing the development of atherosclerosis 
can lead to tests for identifying the disease in its early stages. 

Though the clinical risk factors of CHD are identified 
there is a need for additional understanding on the disease 
progression for effective management 0. Researchers have 
been focusing on techniques for quantifying atherosclero¬ 
sis plaque morphology, composition, mechanical forces etc., 
hoping for better patient screening procedures. Imaging of 
atherosclerotic plaques helps in both diagnosis and moni¬ 
toring of the progression for future management 0, 0- 
There are two modes of imaging atherosclerosis (i) invasive 
and (ii) non-invasive. Among the invasive methods X-ray 
angiography is the gold standard imaging technique even 
though it has certain limitations in providing information 
on plaque composition (SB- The invasive procedures such 
as intravascular ultrasound (IVUS) ifTTl and angioscopy l(T2l 
help in understanding the plaque size and to a limited ex¬ 
tent its composition. The intravascular thermography D3 
aids in monitoring the changes in plaque composition and 
metabolism. The non-invasive procedures such as B-mode 
ultrasound sm, computerized tomography (CT) El and mag- 
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netic resonance imaging (MRI) Ifl6l . can provide information 
on plaque composition on vascular beds but they fail to throw 
much light on the metabolic activity of the plaque inflamma¬ 
tory cells. Though nuclear imaging techniques such as single 
photon emission computed tomography (SPECT) and positron 
emission tomography (PET) El have the potential for 2D 
and 3D surface reconstruction of thrombus using radio labels 
to provide information on molecular, cellular and metabolic 
activity of plaques E), they lack the required resolution, 
sensitivity for detection and functional assessment in medium 
to small size arteries found in coronary circulation. 

Keeping in view the limitations of the imaging techniques 
there is still a greater need for developing automated methods 
for predicting the risk factors of atherosclerosis disease in 
individuals which would be of great help in reducing the 
disease related deaths. Machine learning approaches have been 
employed in a variety of real world problems to extract knowl¬ 
edge from data for predictive tasks. The presence of large 
number of attributes in medical databases affects the decision 
making process as some of the factors may be redundant or 
irrelevant. Also, the presence of missing values and highly 
skewed value distributions in the attributes of medical datasets 
require development of new preprocessing strategies. Feature 
selection methods are aimed at identifying feature subsets 
to construct models that can best describe the dataset. The 
other advantages in using feature selection methods include: 
identifying and removing of redundant/irrelevant features El- 
reducing the dimensionality of the dataset, and improving 
the predictive capability of the classifier. The present study 
attempts to identify risk factors causing atherosclerosis and 
the possible risk that the individuals are running at. 

In view of the above challenges, we present the following 
novel features of our work: 


• identifying the missing values (MV) in the dataset 
and imputing them by using a newly developed non- 
parametric imputation procedure; 

• determining factors that would help in predicting the risk 
of developing atherosclerosis among the different groups 
in the community; 

• building a predictive model that has the capability of 
rendering effective prediction of risk factors in realtime. 

• comparing the performance of our methodology with 
other state-of-the-art methods used in the identification 
of risk factors of atherosclerosis; 

• estimating the time complexity and scalability of our new 
methodology. 


This paper is organized as follows: A brief survey of the 
state-of-the-art techniques employed in predicting the risk 
factors and understanding the biology of atherosclerosis is 
discussed in Section [IIJ while in Section [III] we discuss a 
novel methodology for predicting the clinical risk factors of 
atherosclerosis. The description of the datasets, experiments 
and results are presented in Section IV The conclusions 
are placed in Section [V] and the discussion is deferred to 
Section [VI] 


II. A brief Survey of state-of-the-art techniques 

In |20| a test for predicting atherosclerosis is proposed using 
genetic algorithm and a fitness function that depends on area 
under the curve (AUC) of receiver operator characteristics 
(ROC). In ED a three step approach based on clustering, su¬ 
pervised classification and frequent itemsets search is adopted 
to predict if a patient can develop atherosclerosis according 
to the correlation between his or her habits and the social 
environment. Support vector machines were employed in li22ll 
for discriminating patients between coronary and non-coronary 
heart disease. Supervised classifiers such as Naive Bayes 
(NB), Multi Layer Perceptron (MLP), Decision Trees (DT) 
utilize the associations among the attributes for predicting 
future cardiovascular disorders in the individuals li23l . A 
correlation based feature selection with C4.5 decision tree 
is applied m for risk prediction of cardiovascular disease. 
Recent developments in imaging methods for diagnosis have 
given new insights on the molecular and metabolic activity of 
atherosclerotic plaques E2i, in. m. There are many other 
studies wherein machine learning techniques have also been 
employed for predicting the risk of CHD due to atherosclerosis 
using ultrasound and other imaging methods E3. ESI. Com¬ 
munity based studies help in understanding the risk factors of 
atherosclerosis in different social strata ESI, EDI. 

In the above studies the missing values (MV) were ei¬ 
ther deleted M, or filled with approximate values (20l . 
A windowing method was employed ED for obtaining the 
aggregates of the attributes for imputation of MVs. These 
approaches would lead to biased estimates and may either 
reduce or exaggerate the statistical power. Methods such 
as logistic regression, maximum likelihood and expectation 
maximization have been employed for imputation of MV, but 
they can be applied only on data sets that are either nominal 
or numeric. There are other imputation methods such as k- 
nearest neighbor imputation (KNNI) H32H ; k-means clustering 
imputation (KMI) 11331 ; weighted k-nearest neighbor imputa¬ 
tion (WKNNI) ff34l and fuzzy k-means clustering imputation 
(FKMI) ES- 

III. Novel Methodology for Predicting Risk 

FACTORS OF ATHEROSCLEROSIS 

The mean value imputation proposed by Sree Hari Rao 
and Naresh Kumar lf35l can be employed only when the 
attribute values are normally distributed. In case of highly 
skewed attribute values the above method may result in biased 
estimates as mean value is not a true representative of a 
non-normal distribution. Motivated by the above issues we 
propose a methodology comprising of a novel nonparametric 
missing value imputation method that can be applied on (i) 
data sets consisting of attributes that are of the type categorical 
(nominal) and/or numeric (integer or real), and (ii) attribute 
values that belong to highly skewed distribution. The method¬ 
ology proposed by Freud and Mason |[36l ignores missing 
values while generating the decision tree, which renders lower 
prediction accuracies. In this paper we propose a new feature 
subset selection methodology where in, a particle optimization 
search (PSO) is wrapped around an alternating decision tree 
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(ADT) embedded with new imputation strategy discussed 
in Section III-B[ for generation of effective decision rules. 
This methodology can predict the diagnosis of CHD in real 
time. In fact the decision rules obtained by employing this 
novel methodology will be useful to diagnose other individuals 
based on their risk factors. We designate the present machine 
learning approach as predictive risk assessment of atheroscle¬ 
rosis ( PRAA ) methodology throughout this work. 

We adopt the procedure suggested by Sree Hari Rao and 
Naresh Kumar |35l for building and evaluating the classifica¬ 
tion model using an ADT. 


A. Data Representation 

A medical dataset can be represented as a set S 
having row vectors (Ri, f? 2 ,..., R m ) and column vec¬ 
tors (Ci, C' 2 , ■ ■ ■, C n ). Each record can be represented as 
an ordered n-tuple of clinical and laboratory attributes 
(An, A i2 ,..., A in ) for each i = 1, 2,..., m where 

the last attribute (A,; n ) for each i, represents the physician’s 
diagnosis to which the record (An, A^, ..., ^(n-i)) belongs 
and without loss of generality we assume that there are no 
missing elements in this set. Each attribute of an element in 
S that is Aij for i = 1,2,... , to and /' = 1,2,..., n — 1 can 
either be a categorical (nominal) or numeric (real or integer) 
type. Clearly all the sets considered are finite sets. 


B. New Imputation Strategy 

The first step in any imputation algorithm is to compute 
the proximity measure in the feature space among the clinical 
records to identify the nearest neighbors from where the values 
can be imputed. The most popular metric for quantifying the 
similarity between any two records is the Euclidean distance. 
Though this metric is simpler to compute, it is sensitive to the 
scales of the features involved. Further it does not account for 
correlation among the features. Also, the categorical variables 
can only be quantified by counting measures which calls 
for the development of effective strategies for computing the 
similarity f37). Considering these factors we first propose 
a new indexing measure Ic l (Ri, Rk) between two typical 
elements Ri, Rk for i,k = 1,2,... ,m, l = 1 , 2,..., n — 1 
belonging to the column Ci of S which can be applied on any 
type of data, be it categorical (nominal) and/or numeric (real). 
We consider the following cases: 

Case I. Ai n — Akn 

Let A denote the collection of all members of S that 
belong to the same decision class to which f?, and Rk 
belong. Based on the type of the attribute to which the 
column Ci belongs, the following situations arise: 

(i) Members of the column Ci of S i.e 
(An, A21,..., A m i) T are of nominal or categorical or 
integer type: 

We now express A as a disjoint union of non-empty 
subsets of A, say B lp ^, Bn P2l ,..., B lp ; obtained 
in such a manner that every element of A belongs 
to one of these subsets and no element of A is 
a member of more than one subset of A. That 


is A = B Jpii (jB lp2i (J,...,(JB 1Psi , in which 
7 pu) l P 2 i ! • • •) Ipsi denote the cardinalities of the 
respective subsets B lpu , B lp ^ ,..., B lp ( formed out 
of the set A, with the property that each member of 
the same subset has the same first co-ordinate and 
members of no two different subsets have the same 
first co-ordinate. We define an index Ic t (Ri, Rk) for 
each ( = 1 , 2 , ...,n — 1 


ICi(RiiRk) 


min{^ir, ^U-}, for i ^ k; 

0 , otherwise. 


where y Pil represents the cardinality of the subset 
B~ , all of whose elements have first co-ordinates 

'■Pil 

An, 7 qtcl represents the cardinality of that subset , 
all of whose elements have first co-ordinates Aki and 
7 = 7 piz + 7 P 2 z + • • ■, + 7 Pal represents the cardinality 
of the set A. 

(ii) Members of the column Ci of S i.e 
(An, A 2 i, ..., Ami) T are of real (fractional or 
non-integer numbers): 

We consider the set Pi for l = 1,2, ...,n — 1 
which is a collection of all the members of the 
column Ci. We then compute the skewness measure 
sk(PA = — m £i=A-hi —4 = - where An denote 

the mean of An for each l = 1,2,..., n — 1. Define 
the sets Mi = {a £ Pi\a < An, for sk(Pi) < 0} 
or. Mi = {b £ Pi\b > An, for sk(Pi) > 0} and 
similarly Ni = {a £ Pi\a < A k iJor sk(Pi) < 0} or, 
Ni = {b £ Pi\b > Au, for sk(Pi) > 0}. Let r ; and pi 
be the cardinalities of the sets Mi and N[ respectively. 
Construct the index Ic t (Ri, R k ), 


ICi (Ri, Rk) 


minj^, —}, for i ^ k\ 
0 , otherwise. 


In the above definition 7 represents the cardinality of 
the set Pi. 

Case II: A 

in 7 ^ -Akn 

Clearly Ri and R k belong to two different decision 
classes. Consider the subsets Pi and Q k consisting of 
members of S that share the same decision with Ri and 
Rk respectively. Clearly Pi fj Qk = 0- Based on the type 
of the attribute of the members, the following situations 
arise: 

(i) Members of the column Ci of S i.e 

(An, A 21 ,..., A m i) T are of nominal or categorical 
type: 

Following the procedure discussed in Case I item (i) we 
write Pi and Qi for each ( = 1 , 2 ,..., n —1 as a disjoint 
union of non-empty subsets of Pp u , Pp 2l ,..., Pp rl 
and Qs n , Qs 2l , • • ■, Qs Bl respectively in which 
Pu, fcu ■ • -, Pri and 6u,6 2 i,...,6 s i indicate the 
cardinalities of the respective subsets. We define the 
indexing measure between the two records Ri and R k 
as 


Ici (Ri,Rk) 


0 , otherwise. 
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where /3ri represents the cardinality of the subset Pp rl 
all of whose elements have first co-ordinates An in 
set Pi and S s i represents the cardinality of that subset 
Qs sl , all of whose elements have first co-ordinates Am 
in set Qi. 

(ii) Members of the column C/ of S i.e 
(An,A 2 i,... ,A m i) T are of numeric type: 

If the type of the attribute is an integer we follow the 
procedure discussed in Case II item (i) . For fractional 
numbers we follow the procedure discussed in Case I 
item (ii) and we define the set Pi as the members of 
column Ci that belong to decision class Ai n and Qi 
as the members of column Ci that belong to decision 
class A / l!n . We then compute the skewness measure 

sk(Pi) = —- where An denote 

the mean of An for each l = 1,2, ...,n — 1 and 

sk(Qi) = — jn_J^i=A A ki Aki) - w here Am denote 

(V^T,T = 1 ( A u-A kl ) 2 ) 3 

the mean of Am for each l = 1 , 2 , ...,n — 1 . 
We then construct the sets 1) and Si as 
follows: Ti = {a £ Pi\a < An, for sk(Pi) < 0 } 

or, Ti = {b £ Pi\b > An . for sk(Pi) > 0 } and 
similarly Si = {a £ Pi\a < A k i, for sk(Pi) < 0 } 
or. Si = {b £ Pi |6 > Aki ,for sk(Pi) > 0}. We now 
define the index Ici (Ri,Rk) between the two records 
Ri,Rk as 


Ici(Ri, Rk ) 


min{y,y}, for * 7 ^ fc; 
0 , otherwise. 


In the above definition /3 1 and Si represents the cardi¬ 
nalities of the sets 7] and Si respectively. The sum of 
the cardinalities of the sets Pi and Qi is represented 
by A 

The proximity or distance scores between the clinical 
records in the data set S can be represented as D = 

{{0, di 2 , . . . , di m}', {d 2 l , 0, . . . , ^2m dm 1 7 dm2 ? • • • ? 0 }} 

where dik = \jYldZi Ic t -Rfc)- For each of the missing 

value instances in a record R, our imputation procedure first 

, .1 / \ (xv.—median (x)) i 

computes the score a(xk) = —- ,• , where 

{xi,x 2 , ■ ■ ■ ,x n } denote the distances of R from Rk- We then 
pick up only those records (nearest neighbors) which satisfy 
the condition a(xk) < 0 where {dn, di 2 ,..., di m } denote 
the distances of the current record Ri to all other records 
in the data set S. If the type of attribute is categorical or 
integer, then the data value that has the highest frequency 
(mode) of occurrence in the corresponding columns of the 
nearest records is imputed. For the data values of type real 
we first collect all non zero elements in the set D and 
denote this set by B. For each element in set B we compute 
the quantity P(j) = ~gjjj V) = 1,...,7 where 7 denote the 
cardinality of the set B. We compute the weight matrix as 
W(j ) = P{i) ^7 = 1,... , 7 . The value to be imputed 
may be taken as YH= 1 •F’(i) * W(i). 


Algorithm 1 The PRAA Methodology 

Input: (a) Data sets for the purpose of decision making S(m, n ) where m and n are 
number of records and attributes respectively and the members of S may have MV 
in any of the attributes except in the decision attribute, which is the last attribute 
in the record. 

(b) The type of attribute C of the columns in the data set. 

Output: (a) Classification accuracy for a given data set S. 

(b) Performance metrics AUC, SE, SP. 

Algorithm 

(1) Identify and collect all records in a data set S 

(2) Impute the MV in the data set S using the procedure discussed in Section [Ill-B| 

(3) Extract the influential features using a wrapper based approach with particle swarm 
optimization search for identifying feature subsets and ADT for its evaluation as 
discussed in Section flll-CI 

(4) Split the dataset in to training and testing sets using a stratified k fold cross 
validation procedure. Denote each training and testing data set by Tk and Rk 
respectively. 

(5) For each k compute the following 

(i) Build the ADT using the records obtained from Tk- 

(ii) Compute the predicted probabilities (scores) for both positive and negative 
diagnosis of CHD from the ADT built in Step (5)-(i) using the test data set 
Rk - Designate the set consisting of all these scores by P. 

(iii) Identify and collect the actual diagnosis from the test data set Rk in to set 
denoted by L. 

(6) Repeat the Steps (5)-(i) to Step (5)-(iii) for each fold. 

(7) Obtain the performance metrics AUC, SE and SP utilizing the sets L and P. 

(8) RETURN AUC, SE, SP. 

(9) END. 


C. Particle swarm optimization search for feature subset se¬ 
lection (Risk factors) 

A PSO search consists of a set of particles initialized with 
a candidate solution to a problem. Each particle is associated 
with a position vector and a velocity vector. The particles 
evaluate the fitness of the solutions iteratively and store the 
location where they had their best fit known as the local best 
( L ). The particles change their position and velocity iteratively 
in a suitable manner with respect to the best fit solution to 
reach a global optimal solution. The best fit solution among 
the particles is called the global best (G). We represent the 
position vector of the particle as a binary string and accuracy 
of the learning algorithm as the fitness function for evaluation. 
The velocity and position vectors of the particles are modified 
using the procedure suggested in li49l . 

IV. Experiments and Results 

A. Dataset 

In the present study we use the STULONG dataset ll42l 
which is a longitudinal primary preventive study of middle- 
aged men lasting twenty years for accessing the risk of 
atherosclerosis and cardiovascular health depending on per¬ 
sonal and family history collected at Institute of Clinical and 
Experimental Medicine (IKEM) in Praha and the Medicine 
Faculty at Charles University in Plzen (Pilsen). The STU¬ 
LONG dataset is divided into four sub-groups namely Entry, 
Letter, Control and Death. The Entry dataset consists of 
1417 patient records with 64 attributes having either codes or 
results of size measurements of different variables or results 
of transformations of the rest of the attributes during the 
hist level examination. We utilize the Entry, Control and 
Death datasets for our predictive modeling. The Entry level 
dataset is divided into three groups (a) normal group (NG), 
(b) pathological group (PG), (c) risk group (RG), and (d) not 
allotted (NA) group, based on the studied group of patients 
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(KONSKUP) in (1, 2), (5), (3. 4), (6) respectively. We form 
a new dataset by joining Entry, Control and Death datasets 
as follows: (i) we write the identification number of a patient 
(ICO) based on the selection criteria suggested in ll43l and 
determine the susceptibility of a patient to atherosclerosis 
based on the attributes recorded in Control and Death tables. 
An individual is considered to have cardiovascular disease if 
he or she has history of heart disease (i.e., he or she has at least 
one positive value on attributes such as myocardial infarction 
(HODN2), cerebrovascular accident (HODN3), myocardial is¬ 
chaemia (HODN13), silent myocardial infarction (HODN14)), 
or died of heart disease (i.e., the record appears in the 
Death table with PRICUMR attribute equal to 05 (myocardial 
infarction), 06 (coronary heart disease), 07 (stroke), or 17 
(general atherosclerosis)). Based on the above definition we 
divide the Entry dataset in to two datasets DS1 and DS2 
depending on whether the patients are in NG or RG group 
respectively. 

B. Description of Experiments 

In our methodology we have employed a stratified ten-fold 
cross validation ( /,: = 10) procedure. We have applied a stan¬ 
dard implementation of SVM with radial basis function kernel 
using LibSVM package ll38l . We have taken the following 
standard parameter values for PSO (i) number of particles 
Z = 50, (ii) number of iterations G = 100, (iii) cognitive 
factor ci = 2, and (iv) social factor ci = 2. The standard 
implementation of C4.5, Naive Bayes (NB), Multi Layer 
Perceptron (MLP) algorithms in Weka® l39l are considered 
for evaluating the performance of our algorithm. An imple¬ 
mentation of correlation based feature selection (CFS) gol 
algorithm with genetic search has been considered for com¬ 
paring with our methodology. Also, we have implemented the 
PRAA algorithm and the performance evaluation methods 
in Matlab®. A non-parametric statistical test proposed by 
Wilcoxon El is used to compare the performance of the 
algorithms. 

C. Performance Measures and Results 

The PRAA methodology has outperformed (see Table [Tj) 
the classifiers C4.5, SVM, MLP and NB in terms of sensi¬ 
tivity (SE), specificity (SP) and AUC performance metrics. 
In risk group dataset DS2 our methodology could identify 
the patients with an accuracy of 99.73% who are affected 
by atherosclerosis using only 13 out of 51 attributes. The 
wrapper based feature selection using PSO and ADT could 
identify the influential factors such as alcohol (ALKOHOL), 
daily consumption of tea (CAJ), hypertension or ictus (ICT), 
hyperlipoproteinemia (HYPLIP), since how long hyper tension 
(HT) has appeared (HTTRV), before how many years hyper¬ 
lipidemia had appeared (HYPLTRV), blood pressure II systolic 
(DIAST2), cholesterol in mg % (CHLST), Glucose in urine 
(MOC), obesity (OBEZRISK), hypertension (HTRISK) which 
are in conformity with other studies related to cardiovascular 
diseases m, Ei, Ei. We have identified an important 
fact that even in normal group DS1 individuals who mostly 


TABLE I Performance comparison of the PRAA with other 
methodologies (C4.5, SVM, NB, MLP) on the data sets used 
in the present study 


Dataset 

Method 

Accuracy 

(%) 

SE 

SP 

AUC 


PRAA 

98.04 

93.75 

100.00 

0.94 


C4.5 

70.59 

6.25 

100.00 

1.00 

DS1 

NB 

52.94 

56.25 

51.43 

0.53 

SVM 

35.29 

100.00 

5.71 

0.53 


MLP 

66.67 

0.00 

97.14 

0.97 


PRAA 

99.73 

99.35 

100.00 

1.00 


C4.5 

50.00 

55.48 

45.97 

0.52 

DS2 

NB 

61.20 

48.39 

70.62 

0.58 

SVM 

45.63 

92.26 

11.37 

0.56 


MLP 

57.92 

0.65 

100.00 

1.00 



Fig. 1. Alternating Decision tree generated for dataset DS2 

confine to sitting positions without any physical activity (AK- 
TPOZAM=l) may lead to atherosclerosis as observed in El- 
The ADT generated for the risk group (DS2) is shown in 
Fig. □ The following decision rules are extracted from the 
decision tree shown in Fig. [I] 

1) The risk of atherosclerosis increases by a factor of 
—2.098 if an individual is suffering from hyperlipidemia 
since 1.5 years 

2) The presence of hypertension (< 3.5 years) would in¬ 
crease the risk of atherosclerosis by a factor of —1.348; 

3) The risk is estimated as —2.098—1.348 — 2.849 — 0.966 — 
0.43 — 0.249 = —7.94 if an individual is suffering from 
hyperlipidemia from anywhere between 1.5 years to 4 
years (HYPLTRV > 1.5 and < 4.0), with hypertension 
since 3.5 years and cholesterol levels less than 166. It is 
observed that the presence of hyperlipoproteinemia and 
Glucose in urine would increase the risk of atherosclero¬ 
sis. 

V. Conclusions 

A new methodology (PRAA) with built in features for 
imputation of missing values that can be applied on datasets 
wherein the attribute values are either normal and/or highly 
skewed having either categorical and/or numeric attributes and 
identification of risk factors using wrapper based feature selec¬ 
tion is discussed. The PRAA methodology has outperformed 
over the state-of-the-art methodologies in determining the risk 
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factors associated with the onset of atherosclerosis disease. 
The PRAA methodology has generated a decision tree with an 
accuracy of 99.7% for dataset DS2. Based on the performance 
measures we conclude that the use of PSO search for feature 
subset selection wrapped around ADT embedded with new 
imputation strategy for fitness evaluation have improved the 
prediction accuracies. 

VI. Discussion 

In this section we present a discussion on the performance 
of PRAA methodology on benchmark datasets, its compu¬ 
tational complexity, scalability and comparative studies with 
other methodologies. 

A. Performance Comparison of new imputation procedure on 
Benchmark Datasets 

Since no specific studies on imputation of missing values 
in cardiovascular disease data sets are available in the liter¬ 
ature we have utilized some bench mark data sets obtained 
from Keel and University of California Irvin (UCI) machine 
learning data repositories ED, ED to test the performance 
of the new imputation algorithm. The Wilcoxon statistics in 
Table [II] is computed based on the accuracies obtained by 
the new imputation algorithm with the accuracies of those 
obtained by using embedded algorithms for handling missing 
values such as C4.5 decision tree. The results in Table Ullbelow 
clearly demonstrate that the imputation procedure presented 
in Section |III-B| has superior performance when compared to 
other imputation algorithms as the test statistics are well below 
or equal to the critical values with p < 0.05 in all cases. 


TABLE II Wilcoxon sign rank statistics for matched pairs 
comparing the new imputation algorithm with other imputation 
methods using C4.5 decision tree 


Method 

Rank Sums 

(+, -) 

Test 

Statistics 

Critical 

Value 

p-value 

FKMI 

28.0, 0.0 

0.0 

3 

0.02 

KMI 

28.0, 0.0 

0.0 

3 

0.02 

KNNI 

15.0, 0.0 

0.0 

0 

0.06 

WKNNI 

21.0, 0.0 

1.0 

18 

0.03 


B. Comparison with other Related Methodologies on 
Atherosclerosis 

In this section we compare the results obtained in Q, l24l . 
ll22l with the results of our new methodology on the risk group 
dataset DS2. As compared to l24l where in CFS with genetic 
search and C4.5 is employed, an SE of 39.4% and SP of 
82.8% is observed. In ll22ll SVM was used to classify the 
patients with an SE of 95% and SP of 90%. In Q both NB 
and MLP were used for classification with an accuracy of 80% 
and the could obtain an SE of 92% and 82%, an SP of 53% 
and 76% respectively. The new methodology when applied on 
the dataset DS2 resulted in an accuracy of 99.73%, an SE 
of 99.35% and an SP of 100% which is regarded as a good 
classification model since both SE and SP are higher than 
80%. 



Fig. 2. Computational complexity of the PRAA 


C. Computational Complexity and Scalability of PRAA Al¬ 
gorithm 

The computational complexity is a measure of the per¬ 
formance of the algorithm which can be measured in terms 
of the number of CPU clock cycles elapsed in seconds for 
performing the methodology on a dataset. For each data set 
having n attributes and to records, we select only those subset 
of records to 1 < m, in which missing values are present. 
The distances are computed for all attributes n excluding the 
decision attribute. So, the time complexity for computing the 
distance would be 0(jn\ * (n — 1)). The time complexity 
for computing skewness is O(toi). The time complexity for 
selecting the nearest records is of order ()(rri \). For com¬ 
puting the frequency of occurrences for nominal attributes 
and weighted average for numeric attributes the time taken 
would be of the order 0(mi). Therefore, for a given data 
set with fc-fold cross validation having n attributes and m 
records, the time complexity of our new imputation algorithm 
would be k * (0(toi * (n — 1) * to) + 3 * O(toi)) which is 
asymptotically linear. Our experiments were conducted on a 
personal computer having an Intel(R) core (TM) 2 Duo, CPU 
@2.93 GHZ processor with 4 GB RAM and the time taken by 
PRAA for varying database sizes is shown Fig. [2] We have 
employed a linear regression on our results and obtained the 
relation between the time taken (T) and the data size (D) as 
T = 14.909 D - 104.655, a = 0.05, p = 0.0003, r 2 = 0.994. 

The presence of the linear trend between the time taken and 
the varying database sizes ensure the numerical scalability of 
the performance of PRAA methodology. A comparison with 
other related methodologies used in the study of atheroscle¬ 
rosis yields a conclusion that the new PRAA methodology 
presented in this paper has a superior performance over other 
methods studied in J3, l24l . Il22ll . We hold the view that more 
intensive and introspective studies of this kind will pave way 
for effective risk prediction and diagnosis of atherosclerosis. 
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